Introduction

Diamond prices is can be predicted according to its specifications such as carat, cut, color and clarity. Since x-y-z variables mostly defines carat they might be behaving as the same way as carat did. We can observe the correlation of each variable with price on the ggpairs plot. Some of the variables are categorical, in order to see the effect them on price can be seen after converting them into numeric variables. But plotting them as color identity in carat-price plot may give some clue about their behavior.

Loading Libraries and Data

library(vtreat)
library(GGally)
library(gridExtra)
library(tidyverse)
library(randomForest)
library(Metrics)
set.seed(503)

diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

Overview and Exploratory Data Analysis

Lets have a look at train data:

glimpse(diamonds_train)
## Rows: 43,143
## Columns: 11
## $ carat      <dbl> 0.21, 0.23, 0.29, 0.31, 0.24, 0.26, 0.22, 0.23, 0.30, 0.23…
## $ cut        <ord> Premium, Good, Premium, Good, Very Good, Very Good, Fair, …
## $ color      <ord> E, E, I, J, I, H, E, H, J, J, F, J, E, I, J, J, J, H, J, G…
## $ clarity    <ord> SI1, VS1, VS2, SI2, VVS1, SI1, VS2, VS1, SI1, VS1, SI1, SI…
## $ depth      <dbl> 59.8, 56.9, 62.4, 63.3, 62.3, 61.9, 65.1, 59.4, 64.0, 62.8…
## $ table      <dbl> 61, 65, 58, 58, 57, 55, 61, 61, 55, 56, 61, 54, 62, 54, 54…
## $ price      <int> 326, 327, 334, 335, 336, 337, 337, 338, 339, 340, 342, 344…
## $ x          <dbl> 3.89, 4.05, 4.20, 4.34, 3.95, 4.07, 3.87, 4.00, 4.25, 3.93…
## $ y          <dbl> 3.84, 4.07, 4.23, 4.35, 3.98, 4.11, 3.78, 4.05, 4.28, 3.90…
## $ z          <dbl> 2.31, 2.31, 2.63, 2.75, 2.47, 2.53, 2.49, 2.39, 2.73, 2.46…
## $ diamond_id <int> 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 2…
summary(diamonds_train)
##      carat               cut        color       clarity          depth      
##  Min.   :0.2000   Fair     : 1285   D:5416   SI1    :10449   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 3923   E:7835   VS2    : 9806   1st Qu.:61.00  
##  Median :0.7000   Very Good: 9662   F:7629   SI2    : 7354   Median :61.80  
##  Mean   :0.7985   Premium  :11036   G:9037   VS1    : 6538   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :17237   H:6646   VVS2   : 4052   3rd Qu.:62.50  
##  Max.   :5.0100                     I:4336   VVS1   : 2923   Max.   :79.00  
##                                     J:2244   (Other): 2021                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2403   Median : 5.700   Median : 5.710  
##  Mean   :57.45   Mean   : 3939   Mean   : 5.732   Mean   : 5.736  
##  3rd Qu.:59.00   3rd Qu.: 5352   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18818   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z            diamond_id   
##  Min.   : 0.000   Min.   :    2  
##  1st Qu.: 2.910   1st Qu.:13586  
##  Median : 3.520   Median :26991  
##  Mean   : 3.539   Mean   :26995  
##  3rd Qu.: 4.040   3rd Qu.:40454  
##  Max.   :31.800   Max.   :53940  
## 
diamonds_train %>% summarise_all(funs(sum(is.na(.)))) # is there any na values? 
## # A tibble: 1 x 11
##   carat   cut color clarity depth table price     x     y     z diamond_id
##   <int> <int> <int>   <int> <int> <int> <int> <int> <int> <int>      <int>
## 1     0     0     0       0     0     0     0     0     0     0          0

There is no missing values in the data. ### Exploring distribution of variables Since there are many variables, I will use the scatter plot matrix to get a quick and easy view of the distribution and correlation of different variables.

In order to keep it simple first investigate quantitative variables.

ggpairs(diamonds_train, columns = c(1,5,6,7),  aes(alpha =0.5), title = 'Quantitative Variables vs Price')

ggpairs(diamonds_train, columns = c(2,3,4,7,8,9,10),  aes(alpha =0.5),title = 'Qualitative Variables vs Price')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now we will take a detailed look at some of the relationships. I will use scaling for Price.

p1 <- ggplot(diamonds_train, aes(x = carat, y = price, color = cut)) + 
    geom_point(alpha = 0.3) + scale_y_sqrt()
p2 <- ggplot(diamonds_train, aes(x = carat, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt() #to see how it doing with square root. 
p3 <- ggplot(diamonds_train, aes(x = carat, y = price, color = clarity)) + geom_point(alpha = 0.3) + scale_y_sqrt() 
p4 <- ggplot(diamonds_train, aes(x = x, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt()
p5 <- ggplot(diamonds_train, aes(x = y, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt() 
p6 <- ggplot(diamonds_train, aes(x = z, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt()
grid.arrange(p1, p2, p3, p4, p5, p6, nrow= 3, ncol = 2)

First we look the price and carat relationship since it has most correlation. While doing this we add cut variable as color argument in the plot and take the root of price.

p1 <- ggplot(diamonds_train, aes(x = carat, y = price, color=cut)) + 
    geom_point(alpha = 0.3) + scale_y_sqrt()
p1

p2 <- ggplot(diamonds_train, aes(x = carat, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt() #to see how it doing with square root.
p2

p3 <- ggplot(diamonds_train, aes(x = carat, y = price, color = clarity)) + geom_point(alpha = 0.3) + scale_y_sqrt()
p3

p4 <- ggplot(diamonds_train, aes(x = x, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt()
p4

Models

Firstly we try default value of price in model1:

model1 <- lm(price ~ ., data=diamonds_train[1:10])

summary(model1)
## 
## Call:
## lm(formula = price ~ ., data = diamonds_train[1:10])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21446.0   -593.0   -182.4    378.8  10701.8 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6032.795    440.677   13.690  < 2e-16 ***
## carat       11306.047     55.032  205.445  < 2e-16 ***
## cut.L         587.057     25.144   23.347  < 2e-16 ***
## cut.Q        -300.964     20.114  -14.963  < 2e-16 ***
## cut.C         148.352     17.317    8.567  < 2e-16 ***
## cut^4         -26.568     13.826   -1.922   0.0547 .  
## color.L     -1949.838     19.404 -100.484  < 2e-16 ***
## color.Q      -672.931     17.640  -38.148  < 2e-16 ***
## color.C      -161.554     16.458   -9.816  < 2e-16 ***
## color^4        25.973     15.117    1.718   0.0858 .  
## color^5       -98.819     14.278   -6.921 4.55e-12 ***
## color^6       -59.129     12.977   -4.556 5.22e-06 ***
## clarity.L    4129.825     33.880  121.897  < 2e-16 ***
## clarity.Q   -1954.978     31.602  -61.863  < 2e-16 ***
## clarity.C     997.867     27.030   36.918  < 2e-16 ***
## clarity^4    -382.021     21.572  -17.709  < 2e-16 ***
## clarity^5     242.941     17.611   13.795  < 2e-16 ***
## clarity^6      12.562     15.328    0.820   0.4125    
## clarity^7      87.466     13.525    6.467 1.01e-10 ***
## depth         -65.123      4.992  -13.045  < 2e-16 ***
## table         -29.084      3.250   -8.950  < 2e-16 ***
## x           -1020.664     34.775  -29.350  < 2e-16 ***
## y              -1.438     19.396   -0.074   0.9409    
## z             -38.006     33.858   -1.123   0.2616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 43119 degrees of freedom
## Multiple R-squared:  0.9202, Adjusted R-squared:  0.9202 
## F-statistic: 2.162e+04 on 23 and 43119 DF,  p-value: < 2.2e-16

Then if we take log of price it seems more straight and suitable for regression:

model2 <- lm(I(log(price)) ~ ., data = diamonds_train[1:10])
summary(model2)
## 
## Call:
## lm(formula = I(log(price)) ~ ., data = diamonds_train[1:10])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3119 -0.0899  0.0008  0.0894  8.8470 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -2.7768048  0.0666391  -41.669  < 2e-16 ***
## carat       -0.6806057  0.0083219  -81.785  < 2e-16 ***
## cut.L        0.1047956  0.0038023   27.561  < 2e-16 ***
## cut.Q       -0.0368615  0.0030416  -12.119  < 2e-16 ***
## cut.C        0.0364380  0.0026187   13.915  < 2e-16 ***
## cut^4        0.0125579  0.0020907    6.007 1.91e-09 ***
## color.L     -0.4518299  0.0029343 -153.981  < 2e-16 ***
## color.Q     -0.1029256  0.0026675  -38.585  < 2e-16 ***
## color.C     -0.0123329  0.0024888   -4.955 7.25e-07 ***
## color^4      0.0175279  0.0022860    7.667 1.79e-14 ***
## color^5     -0.0081027  0.0021592   -3.753 0.000175 ***
## color^6      0.0020423  0.0019624    1.041 0.298013    
## clarity.L    0.8926304  0.0051233  174.230  < 2e-16 ***
## clarity.Q   -0.2572644  0.0047788  -53.834  < 2e-16 ***
## clarity.C    0.1421625  0.0040874   34.781  < 2e-16 ***
## clarity^4   -0.0668565  0.0032621  -20.495  < 2e-16 ***
## clarity^5    0.0287787  0.0026632   10.806  < 2e-16 ***
## clarity^6   -0.0039185  0.0023179   -1.691 0.090933 .  
## clarity^7    0.0284550  0.0020452   13.913  < 2e-16 ***
## depth        0.0537834  0.0007549   71.241  < 2e-16 ***
## table        0.0092119  0.0004914   18.745  < 2e-16 ***
## x            1.1979141  0.0052587  227.795  < 2e-16 ***
## y            0.0306626  0.0029331   10.454  < 2e-16 ***
## z            0.0388247  0.0051200    7.583 3.44e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1708 on 43119 degrees of freedom
## Multiple R-squared:  0.9717, Adjusted R-squared:  0.9717 
## F-statistic: 6.445e+04 on 23 and 43119 DF,  p-value: < 2.2e-16

Lastly we try another model including random forest method:

model3 <-  randomForest(
    price ~ .,
    data= diamonds_train[-11]
  )
summary(model3)
##                 Length Class  Mode     
## call                3  -none- call     
## type                1  -none- character
## predicted       43143  -none- numeric  
## mse               500  -none- numeric  
## rsq               500  -none- numeric  
## oob.times       43143  -none- numeric  
## importance          9  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y               43143  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call

Evaluation of Models

test_df <- diamonds_test %>% select(-diamond_id) # To match the model dimension which we obtained without diamond_id
pred1 = predict(model1, newdata=test_df[-7])
rmse1 <- rmse(test_df$price, pred1) 
rmse1
## [1] 1132.834
pred2 = predict(model2, newdata=test_df[-7]) 
rmse2 <- rmse(test_df$price, pred2) 
rmse2
## [1] 5556.116
pred3 = predict(model3, newdata=test_df[-7]) 
rmse3 <- rmse(test_df$price, pred3) 
rmse3
## [1] 561.9026

According to the results, model3 is the most successful with the lowest rmse value = 561.9 We can use in order to predict diamond prices with model3.

References